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Abstract 

This paper proposes a learning-based approach to scene 
parsing inspired by the deep Recursive Context Propaga¬ 
tion Network (RCPN). RCPN is a deep feed-forward neural 
network that utilizes the contextual information from the en¬ 
tire image, through bottom-up followed by top-down context 
propagation via random binary parse trees. This improves 
the feature representation of every super-pixel in the im¬ 
age for better classification into semantic categories. We 
analyze RCPN and propose two novel contributions to fur¬ 
ther improve the model. We first analyze the learning of 
RCPN parameters and discover the presence of bypass er¬ 
ror paths in the computation graph of RCPN that can hinder 
contextual propagation. We propose to tackle this problem 
by including the classification loss of the internal nodes of 
the random parse trees in the original RCPN loss function. 
Secondly, we use an MRF on the parse tree nodes to model 
the hierarchical dependency present in the output. Both 
modifications provide performance boosts over the origi¬ 
nal RCPN and the new system achieves state-of-the-art per¬ 
formance on Stanford Background, SIFT-Flow and Daimler 
urban datasets. 

1. Introduction 

Semantic segmentation refers to the problem of label¬ 
ing every pixel in an image with the correct semantic cat¬ 
egory. Handling the immense variability in the appear¬ 
ance of semantic categories requires the use of context to 
achieve human-level accuracy, as shown, for example, by 
[26, 15, 14]. Specifically, [15, 14] found that human per¬ 
formance in labeling a super-pixel is worse than a computer 
when both have access to that super-pixel only. Effectively 
using context presents a significant challenge, especially 
when a real-time solution is required. 

An elegant deep recursive neural network approach for 
semantic segmentation was proposed in [21], referred to as 
RCPN. The main idea was to facilitate the propagation of 
contextual information from each super-pixel to every other 


super-pixel through random binary parse trees. First, a se¬ 
mantic mapper mapped visual features of the super-pixels 
into a semantic space. This was followed by a recursive 
combination of semantic features of two adjacent image re¬ 
gions, using a combiner, to yield the holistic feature vec¬ 
tor of the entire image, termed the root feature. Next, the 
global information contained in the root feature was dis¬ 
seminated to every super-pixel in the image, using a de¬ 
combiner, followed by classification of each super-pixel 
via a categorizer. The parameters were learned by mini¬ 
mizing the classification loss of the super-pixels by back- 
propagation through structure [6] . RCPN was shown to out¬ 
perform recent approaches in terms of per-pixel accuracy 
(PPA) and mean-class accuracy (MCA). Most interestingly, 
it was almost two orders of magnitude faster than compet¬ 
ing algorithms. 

RCPN’s speed and state-of-the-art performance motivate 
us to carefully analyze it. In this paper we show that it still 
has some weaknesses and we show how to remedy them. In 
particular, the direct path from the semantic mapper to the 
categorizer gives rise to bypass errors that can cause RCPN 
to bypass the combiner and decombiner assembly. This can 
cause back-propogation to reduce RCPN to a simple multi¬ 
layer neural network for each super-pixel. We propose mod¬ 
ifications to RCPN that overcome this problem 

1 . Pure-node RCPN - We improve the loss function by 
adding the classification loss of those internal nodes of 
the random parse trees that correspond to a single se¬ 
mantic category, referred to as pure-nodes. This serves 
three purposes, a) It provides more labels for training, 
which results in better generalization, b) It encourages 
stronger gradients deep in the network, c) Lastly, it 
tackles the problem of bypass errors, resulting in bet¬ 
ter use of contextual information. 

2. Tree MRF RCPN - Pure-node RCPN also provides us 
with reliable estimates of the internal node label distri¬ 
butions. We utilize the label distribution of the internal 
nodes to define a tree-style MRF on the parse tree to 
model the hierarchical dependency between the nodes. 
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The resulting architectures provide promising improve¬ 
ments over the previous state-of-the-art on three semantic 
segmentation datasets: Stanford background [7], SIFT flow 
[12] and Daimler urban [17]. 

The next section describes some of the related works fol¬ 
lowed by a brief overview of RCPN in Sec. 3. We describe 
our proposed methods in Sec. 4 followed by experiments in 
Sec. 5. Finally, we conclude in Sec. 6. 

2. Related Work 

The previous work on semantic segmentation roughly 
follows two major themes: learning-based and non- 
parametric models. 

Learning-based models learn the appearance of semantic 
categories, under various transformations, and the relations 
among them using parametric models. CRF based image 
models have been quite successful in jointly modeling the 
appearance and structure of an image; [7, 16, 15, 14] use 
CRFs to combine unary potentials obtained from the visual 
features of super-pixels with the neighborhood constraints. 
The differences among these approaches are mainly in 
terms of the visual features, form of the N-ary potentials 
and the the CRF modeling. A joint-CRF on multiple levels 
of an image segmentation hierarchy is formulated in [11]. It 
achieves better results than a flat-CRF owing to the utiliza¬ 
tion of higher order contextual information coming in the 
form of a segmentation hierarchy. Multi-scale convolution 
neural networks are used in [2] to learn visual feature ex¬ 
tractors from raw-image/label training pairs. It achieved im¬ 
pressive results on various datasets using gPb, purity-cover 
and CRF on top of the learned features. It was extended 
in [18] by feeding in the per-pixel predicted labels using a 
CNN classifler to the next stage of the same CNN classi- 
fler. However, the propagation structure is not adaptive to 
the image content and only propagating label information 
did not improve much over the prior work. 

A type of learning based model was proposed in [23] that 
aims at learning a mapping from the visual features to a se¬ 
mantic space followed by classiflcation. The semantic map¬ 
ping is learned by optimizing a structure prediction cost on 
the ground-truth parse trees of training images with the hope 
that such a training would embed the visual features in a se¬ 
mantically meaningful space, where classiflcation would be 
easier. However, our experiments using the code provided 
by the authors show that semantic space mapping is actually 
no better than a simple 2-layer neural network on the visual 
features directly. 

Recently, a lot of successful non-parametric approaches 
for natural scene parsing have been proposed [25, 12, 22, 
4, 24, 27]. These approaches are instances of sophisticated 
template matching to retrieve images that are visually sim¬ 
ilar to the query, from a database of labeled images. The 
matching step is followed by super-pixel label transfer from 


the retrieved images to the query image. Finally, a struc¬ 
tured prediction model such as CRF is used to jointly utilize 
the unary potentials with plausible image models. These 
approaches differ in terms of the retrieval of candidate im¬ 
ages or super-pixels, transfer of label from the retrieved 
candidates to the query image, and the form of the struc¬ 
tured prediction model. These approaches are based on 
nearest-neighbor retrieval that introduces a critical perfor¬ 
mance/accuracy trade-off. Theoretically, these approaches 
can utilize a huge amount of data with ever increasing accu¬ 
racy. But a very large database would require large retrieval¬ 
time, which limits the scalability of these methods. 

3. Background Material 

In this section, we provide a brief overview of the RCPN 
based semantic segmentation framework, please refer to 
[21] for details. 

3.1. Overview 

RCPN formulates the problem of semantic segmenta¬ 
tion as labeling each super-pixel into desired semantic cate¬ 
gories. The complete pipeline starting from the input image 
to the flnal pixel-wise labels is shown in Fig. 1. It starts 
with the super-segmentation of the image followed by the 
extraction of visual features for each super-pixel; [21] used 
the Multi-scale CNN [2] to extract per pixel features that 
are then averaged over super-pixels. RCPN then constructs 
random binary parse trees obtained using the adjacency in¬ 
formation between super-pixels. The leaf-nodes correspond 
to the initial super-pixels and successive random merger 
of two adjacent super-pixels builds the internal nodes up 
to the root node, which corresponds to the entire image. 
The super-pixel features along with a parse tree are passed 
through an assembly of four modules: {semantic mapper, 
combiner, decombiner and categorizer, in order) that out¬ 
puts labels for each super-pixel. Multiple random parse 
trees can be used, both during training and testing. At test 
time, each parse tree can gives rise to different labels for 
the same super-pixel, therefore, voting is used to decide the 
flnal label. 

Notation: Throughout this article - denotes visual 
features of super-pixel, denotes semantic feature of 
super-pixel and denotes enhanced super-pixel fea¬ 
tures. 

Semantic mapper is a neural network that maps visual 
features of each super-pixel to a dgem dimensional semantic 
feature 

— Fsemiyi] ^^sem) ( 1 ) 

here, Fsem is the network and Wsem are the layer weights. 

Combiner: Combiner is a neural network that recur¬ 
sively maps two child node features {xi and Xj) to their 
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Figure 1: Complete flow diagram of RCPN for semantic segmentation. 


parent feature (xij). Intuitively, the combiner network at¬ 
tempts to aggregate the semantic content of the children fea¬ 
tures such that the parent’s features become representative 
of the children. The root features represent the entire image. 

^i,j — - ( 2 ) 

here, Fcom is the network and Wcom are the layer weights. 

Decombiner is a neural network that recursively dissem¬ 
inates the context information from a parent node to its chil¬ 
dren through the parse tree. This network maps the semantic 
features of the child node and its parent to the contextually 
enhanced feature of the child node. This top-down contex¬ 
tual propagation starts from the root feature and the decom¬ 
biner is applied recursively up to the enhanced super-pixel 
features. Therefore, it is expected that every super-pixel 
feature contains the contextual information aggregated from 
the entire image. 

Xi = Fdec(lXi,Xij]; Wdec)- (3) 

here, Fdec is the network and Wdec are the layer weights. 

Categorizer is the flnal network, which maps the con¬ 
text enhanced semantic features (x^) of each super-pixel to 
one of the semantic category labels; it is a Softmax classifler 

Yj = FcatiXi;Wcat)- (4) 

Together, all the parameters of RCPN are denoted as 
i^rcpn — ^^decs Let s assume there 

are S super-pixels in an image / and denote a set of R ran¬ 
dom parse trees of / as T. Then, the loss function for / 
is 

^ R Si 

r,sRs')Fr-)^Frcpn) (5) 

r=l s=l 

here, yr,s is the predicted class-probability vector and tg 
is the ground-truth label for the super-pixel for random 


parse tree % and L{ysR) is the cross-entropy loss func¬ 
tion. Network parameters, Wrcpn^ are learned by minimiz¬ 
ing jC{I) for all the images in the training data. 

4. Proposed Approach 

In this section, we study the RCPN model, discover po¬ 
tential problems with parameter learning and propose useful 
modiflcations to the learning and the model. Our first mod¬ 
ifications tackle a potential pitfall during training that stems 
from the special architecture of RCPN and can reduce it to 
a simple multi-layer NN. The second modification extends 
the model by building an MRF on top of the parse trees to 
utilize the hierarchical dependency between the nodes. 

4.1. Pure-node RCPN 

Here we propose a model that will handle bypass errors. 
At the same time, this model solves a problem of gradi¬ 
ent attenuation, and also multiplies the training data. For 
the ease of understanding all our discussions will be lim¬ 
ited to 1-layer modules. This result in each of the Wsem^ 
Wcom, Wdec and Wcat as matrices. Like most deep net¬ 
works, RCPN also suffers from vanishing gradients for the 
lower layers. This stems from the vanishing error signal, 
because the gradient (gi) for the layer depends on the 
error signal (e^+i) from the layer above - 

gi = e^+ixf (6) 

here, is the input to the layer. For RCPN, vanishing 
gradients are more of a problem because of very deep parse 
trees due to recursion. For instance, a 100 super-pixel image 
will lead to a minimum of (Iog2{l00) x 2 + 2 > 14) layers 
under the strong assumption of perfectly balanced binary 
parse trees. In practice, we can only create roughly balanced 
binary trees that often lead to ^ 30 layers. 







































We show that the internal nodes of the parse tree can 
be used to alleviate these problem. Each node in the parse 
tree corresponds to a connected region in the image. The 
leaf nodes correspond to the initial super-pixels and the in¬ 
ternal nodes correspond to the merger of two or more con¬ 
nected regions, referred to as merged-region. We use the 
term pure nodes to refer to the internal nodes of the parse 
tree associated with the merger of two or more regions of 
the same semantic category. Therefore, the merged-regions 
corresponding to the pure nodes can serve as additional la¬ 
beled samples during training. We empirically found that 
roughly 65% of all the internal nodes are pure-nodes for 
all three datasets. We include the classification loss of the 
pure-nodes in the loss function (Eqn. 5) for training and re¬ 
fer to the new procedure as pure-node RCPN or PN-RCPN 
for short. The classification loss, now becomes - 

R Pr 

£^(1) = £(/) + :EE -^(yr,p: 'Tr-, ffVcpn) 

^ r=l p=l 

(7) 

here, Pr is the number of pure-nodes for the random 
parse tree % and subscripts (r, p) map to the p^^ pure-node 
for the random parse tree. Note that different parse trees 
for the same image can have different pure nodes. 

In order to understand the benefits of PN-RCPN and con¬ 
trast it with RCPN, we make use of an illustrative example 
depicted with the help of Pig. 2. The left-half of a ran¬ 
dom parse tree for an image / with 5 super-pixels, anno¬ 
tated with various variables involved during one forward- 
backward propagation through RCPN are PN-RCPN are 
shown in Pig. 2a and 2b, respectively. We denote, 

(aC X 1 vector) as the error at enhanced super-pixel nodes; 
e^ec 2dsem X 1 vector) as the error at the decombiner; 
ecom 2dsem X 1 vector) as the error at the combiner and 
(a dsem X 1 vector) as the error at the semantic mapper. 
Subscripts bp and total indicate bypass and the sum total 
error at a node, respectively. We assume a non-zero catego- 
rizer error signal for the first super-pixel only, ie = 0. 
These assumptions facilitate easier back-propagation track¬ 
ing through the parse tree, but the conclusions drawn will 
hold for general cases as well. 

The first obvious benefit of using pure-nodes is more la¬ 
beled samples from the same training data that can improve 
generalization. The second advantage of PN-RCPN can be 
understood by contrasting the back-propagation signals for 
a sample image for RCPN and PN-RCPN, with the help of 
Pig. 2a (RCPN) and 2b (PN-RCPN). Note that in the case of 
RCPN, the back-propagated training signal was generated 
at the enhanced leaf-node features and progressively atten¬ 
uates as it back-propagates through the parse tree, shown 
with the help of variable thickness solid red arrows. On the 
other hand, pure-node RCPN has an internal node (shown 
as a green color node) that injects a strong error signal deep 


into the parse tree, resulting in stronger gradients even in 
the deeper layers. Moreover, PN-RCPN explicitly forces 
the combiner to learn meaningful combination of two super¬ 
pixels, because incorrect classification of the combined fea¬ 
tures is penalized. 

Now, we come to the third benefit of the PN-RCPN ar¬ 
chitecture. In what follows, we describe a subtle yet po¬ 
tentially serious problem related to RCPN learning, provide 
empirical evidence that this problem exists, and argue that 
PN-RCPN can offer a solution to this problem. 

4.1.1 Understanding the Bypass Error 

During the minimization of the loss functions (Eqn. 5 or 7), 
typically, more effective parameters in bringing down the 
objective function receive stronger gradients and reach their 
stable state early. Due to the presence of multiple layers 
of non-linearities and complex connections, the loss func¬ 
tion is highly non-convex and the solution inevitably con¬ 
verges to a local minimum. It was shown in [21] that the 
combiner and decombiner assembly is the most important 
constituent of the RCPN model. Therefore, we expect the 
learning process to pay more attention to Wcom and Wdec- 
Unfortunately, the RCPN architecture introduces short-cut 
paths in the computation graph from the semantic mapper 
to the categorizer during the forward propagation that gives 
rise to bypass errors during back-propagation. Bypass er¬ 
rors severely affect the learning by reducing the effect of 
the combiner on the overall loss function, thereby favoring 
a non-desirable local minimum. 

In order to understand the effect of bypass error, we 
again make use of the example in Pig. 2 to show that by¬ 
pass paths allow the back-propagated error signals from the 
categorizer (e^^^) to reach the semantic mapper through one 
layer only. On the other hand, ep^ goes through multiple 
layers before reaching the combiner. Therefore, the gradi¬ 
ent gcom for the combiner is weaker than the gradient for 
the semantic mapper (gsem)- 

Prom the Pig. 2a we can see that there are two possi¬ 
ble paths for ep^ to reach the combiner. One of them re¬ 
quires 2 layers (xi ^ xe ^ xe) and the other requires 
3 layers (xi ^ xe ^ xg ^ xe). Similarly, ep^ can 
reach xi through a 1 layer bypass path (xi ^ xi) or a 
several layers path through the parse tree. Due to gradient 
attenuation, the smaller the number of layers the stronger 
the back-propagated signal, therefore, bypass errors lead to 
9sem E gcom- This Can potentially render the combiner 
network inoperative and guide the training towards a net¬ 
work that effectively consists of a Ngem + Ndec + Neat 
layer network from the visual feature ( v^) to the super¬ 
pixel label iyi). This results in little or no contextual in¬ 
formation exchange between the super-pixels. In the worst 
case Wdec = \W 0]; this removes the effect of parents on 
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Figure 2: Back-propagated error tracking to visualize the ef¬ 
fect of bypass error. The variables follow the notation intro¬ 
duces in Sec. 3. Forward propagation and back-propagation 
are shown by solid black and red arrows, respectively. The 
attenuation of the error signal is shown by variable width 
red arrows. The bypass errors are shown with dashed red 
arrows, (a) RCPN: Error signal from xi reaches to xi in 
just one step, through the bypass path, (b) PN-RCPN intro¬ 
duces pure-nodes classification loss (for xe), thereby, forc¬ 
ing the network to learn meaningful internal node represen¬ 
tation via combiner, thereby, promoting effective contextual 
propagation. 


their children features during top-down contextual propaga¬ 
tion through the decombiner, thereby completely removing 
the affect of the combiner from RCPN. Practically, the ran¬ 
dom initialization of the parameters ensures that they will 
not converge to such a pathological solution. However, we 
show that a better local minimum can be achieved by tack¬ 
ling the bypass errors. 

In order to see that gsem ^ gcom^ we compute the gra¬ 
dient strengths of each module (gsem, gcom, gdec, gcat) dur¬ 
ing training. The gradient strengths of different modules for 
RCPN and PN-RCPN are normalized by the number of pa¬ 
rameters and plotted in Fig. 3a and Fig. 3b, respectively. As 
expected, gcat is the strongest, because it is closest to the 
initial error signal. Surprisingly, for RCPN gsem is slightly 
stronger than gdec and significantly stronger than gcom dur¬ 
ing the initial phase of training. Normally, we would expect 
gsem, which is the farthest away from the error signal, to 
be the weakest due to vanishing gradients. This observation 
suggests that the initial training phase favors a multi-layer 
NN. However, we also observe that during the later stages 
of training, gcom is comparable to other gradients. Unfor¬ 
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Figure 3: Comparison of gradient strengths of different 
modules of (a) RCPN and (b) PN-RCPN during training. 


tunately, it has been conclusively established, by many em¬ 
pirical studies, that the initial phase of training is crucial for 
determining the final values of the network parameters, and 
thereby their performance [1]. From the figure we see that 
the combiner catches up with the other modules during later 
stages of training, but by then the parameters are already in 
the attraction basin of a poor solution. 

On the other hand, the gradients for PN-RCPN (Fig 3b) 
follow the natural order of strength, which gives more im¬ 
portance to the combiner and decombiner than the seman¬ 
tic mapper during the initial training. Fig. 2b provides an 
intuitive explanation by showing the categorizer error sig¬ 
nal (eg^^) for X6 that reaches to the combiner through one 
layer only (eg^^). To further investigate which of the three 
aforementioned benefits play the biggest role in improving 
the performance of PN-RPCN over RPCN, we trained PN- 
RCPN on SIFT fiow under the same setting as Table 2, but 
we removed as many leaf node labels from the classification 
loss as the number of pure-nodes. This makes the number 

































dren regions, which is a hard constraint: 



Figure 4: Factor graph representation of the MRF model. 


of labeled samples equal in both RCPN and PN-RCPN, but 
leaf-nodes are replaced with pure-nodes. As expected, it 
still improves PPA and MCA score for PN-RCPN (80.5% 
and 35.3%) vs. RCPN (79.6% and 33.6%). This last exper¬ 
iment confirms that inclusion of pure-nodes does not only 
provide more samples but also helps in overcoming the dis¬ 
cussed shortcomings of RCPN. 

4.2. Tree MRF Inference 

The pure node extension of RCPN provides the label dis¬ 
tributions over merged-regions associated with the internal 
nodes in addition to individual super-pixel labels. In this 
section, we describe a Markov Random Field (MRF) struc¬ 
ture to model the output label dependencies of the super¬ 
pixels while leveraging the internal node label distributions 
for hierarchical consistency. The proposed MRF uses the 
same trees structure as that of the parse trees used for RCPN 
inference. A factor graph representation of this MRF is 
shown in Figure 4. The variables Yi are L-dimensional bi¬ 
nary label vectors associated with each region of the image, 
L is the number of possible labels. The dimension of Yi 
is set according to the presence (1) or absence (0) of the 
class super-pixel in the region. 

The unary potentials /i are given by the label distribu¬ 
tions predicted by the RCPN and defined as - 


MYi) = 


-Yi^log(p,) 

r^iii 


( 8 ) 


where is the softmax output of the categorizer network 
for super-pixel i. If the probabilities given by RCPN are 
not degenerate, the unary potential prefers to assign a single 
label, that of the node with the highest probability. 

The pairwise potentials /2 are introduced to impose con¬ 
sistency between a pair of child and parent regions. The 
parent region must include all the labels assigned to its chil- 


f2{Yi,Yj) 


oo, if5(yi)\5(y,)^0. 
0, otherwise. 


( 9 ) 


where node j is the parent node of i and S{Y) is the set of 
all the labels in the merged-region with label vector Y. 

The unary potentials /i utilize all levels of the tree si¬ 
multaneously and prefer purer nodes, whereas pairwise po¬ 
tentials, /2 enforce consistency across the tree hierarchy. 
This design allows for spatial smoothness at lower levels 
and mixed labeling at the higher levels. The tree structure 
of the MRF affords exact decoding using max-product be¬ 
lief propagation. The size of the state space is exponen¬ 
tial in the number of labels. However, in practice there are 
rarely more than a handfull of different object classes within 
an image. Therefore, to reduce the size of the state space, 
we first identify different labels predicted by the RCPN and 
only retain the 9 most frequently occurring super-pixel la¬ 
bels per image. 


5. Experimental analysis 

In this section we evaluate the performance of pro¬ 
posed methods for semantic segmentation on three differ¬ 
ent datasets: Stanford Background, SIFT Flow and Daim¬ 
ler Urban. Stanford background dataset contains 715 color 
images of outdoor scenes, it has 8 classes and the images 
are approximately 240 x 320 pixels. We used the 572 train 
and 143 test image split provided by [23] for reporting the 
results. SIFT Flow contains 2688, 256 x 256 color im¬ 
ages with 33 semantic classes. We experimented with the 
train/test (2488/200) split provided by the authors of [25]. 
Daimler Urban dataset has 500, 400 x 1024 images cap¬ 
tured from a moving car in a city, it has 5 semantic classes. 
We trained the model using 300 images and tested on the 
rest of the 200 images, the same split-ratio has been used 
by previous work on this dataset. 

5.1. Visual feature extraction 

We use a Multi-scale convolution neural network (Multi¬ 
scale CNN) [2] to extract pixel-wise features using publicly 
available library Caffe [8]. We follow [21] and use the same 
CNN structure with similar preprocessing (subtracting 0.5 
from each channel at each pixel location in the RGB color 
space) at 3 different scales (1,1/2 and 1/4) to obtain the 
visual features. The CNN architecture has three convolu¬ 
tional stages with 8 x 8 x 16 conv -^2x2 maxpool 
7 X 7 X 64 conv -^2x2 maxpool 7x7 x 256 conv con¬ 
figuration, each max-pooling is non-overlapping. There¬ 
fore, every image scale gives a 256 dimensional output map. 
The outputs from each scale are concatenated to get the fi¬ 
nal feature map. Note that the 256 x 3 = 768 dimensional 
concatenated output feature map is still l/4th of the height 












and width of the input image due to the max-pooling op¬ 
erations. In order to obtain the input size per-pixel feature 
map we simply scale-up each feature map by a factor of 4 
in height and width using Bilinear interpolation. 

We use the publicly available implementation of [13] to 
obtain 100 (same as RCPN) and 800 super-pixels per im¬ 
age for SIFT Flow and Daimler Urban, respectively. Daim¬ 
ler uses more super-pixels due to its larger size. For Stan¬ 
ford background, we have used the super-pixels provided 
by [23]. 

5.2. Model Selection 

Unlike most of the previous works that rely on careful 
hand-tuning and expert knowledge for setting the model pa¬ 
rameters, we only need to set one parameter, namely dsem^ 
after we have fixed the modules to be 1-layer neural net¬ 
works. This affords a generic approach to semantic seg¬ 
mentation that can be easily trained on different datasets. 
For the sake of strict comparison with the original RCPN 
architecture, we also use 1-layer modules with dgem = 60 
in all our experiments. Plain-NN refers to training a 2-layer 
NN with 60 hidden nodes, on top of visual features for each 
super-pixel. RCPN refers to the original RCPN model [21]. 
PN-RCPN refers to pure-node RCPN and TM-RCPN refers 
to tree-MRF RCPN. 

5.3. Evaluation metrics 

We have used four standard evaluation metrics - 

• Per pixel accuracy (PPA): Ratio of the correct pixels 
to the total pixels in the test images, while ignoring the 
background. 

• Mean class accuracy (MCA): Mean of the category 
wise pixel accuracy. 

• Intersection over Union (loU): Ratio of true posi¬ 
tives to the sum of true positive, false positive and false 
negative, averaged over all classes. This is a popular 
measure for semantic segmentation of objects because 
it penalizes both over- and under-segmentation. 

• Time per image (TPI): Time required to label an im¬ 
age on GPU and CPU. 

The results from previous works are taken directly from 
the published articles. Some of the previous works do not 
report all four evaluation metrics; we leave the correspond¬ 
ing entry blank in the comparison tables. 

5.4. Stanford Background 

We report our results with CNN features extracted from 
the original scale only, because multi-scale CNN features 
overfit, perhaps due to small training data, as observed in 
[21]. We use 10 and 40 random trees for training and test¬ 
ing, respectively. The results are shown in Table 1. From 


Table 1: Stanford background result. 


Method 

PPA 

MCA 

loU 

TPI (s) 

CPU/GPU 

Gould, [ ] 

76.4 

NA 

NA 

30 - 600 / NA 

Munoz, [16] 

76.9 

NA 

NA 

12/NA 

Tighe, [25] 

77.5 

NA 

NA 

4/NA 

Kumar, [9] 

79.4 

NA 

NA 

< 600 / NA 

Socher, [23] 

78.1 

NA 

NA 

NA/NA 

Lempitzky, [11] 

81.9 

72.4 

NA 

> 60 / NA 

Singh, [22] 

74.1 

62.2 

NA 

20/NA 

Farabet, [2] 

81.4 

76.0 

NA 

60.5/NA 

Eigen, [4] 

75.3 

66.5 

NA 

16.6/NA 

Pinheiro, [18] 

80.2 

69.9 

NA 

10/NA 

Plain-NN 

80.1 

69.7 

56.4 

1.1/0.4 

RCPN [21] 

81.8 

73.9 

61.3 

1.1/0.4 

PN-RCPN 

82.1 

79.0 

64.0 

1.1/0.4 

TM-RCPN 

82.3 

79.1 

64.5 

1.6-6.1/0.9-5.9 


the comparison, it is clear that our proposed approaches out¬ 
perform previous methods. We observe that PN-RCPN sig¬ 
nificantly improves the results in terms of MCA and loU 
over RCPN. We observe a marginal improvement offered 
by TM-RCPN over PN-RCPN. 

5.5. SIFT Flow 

We report our results using multi-scale CNN features at 
three scales (1,1/2 and 1/4), as in [21]. Some of the classes 
in SIFT Flow dataset have a very small number of training 
instances, therefore, we also trained with balanced sampling 
to compensate for rare occurrence, referred to as bal. pre¬ 
fix. We use 4 and 20 random trees for training and testing, 
respectively. The results for SIFT fiow dataset are shown 
in Table 2. PN-RCPN led to significant improvement in 
all three measures over RCPN and balanced training led to 
significant boost in MCA. The use of TM-RCPN does not 
affect the results much compared to PN-RCPN. We observe 
a strong trade-off between PPA and MCA on this dataset. 
Our overall best model in terms of both PPA and MCA (bal. 
TM-RCPN) looks equivalent to the work in [27]; PPA: 76.4 
vs. 79.8, MCA: 52.6 vs. 48.8. 

5.6. Daimler Urban 

We report our results using multi-scale CNN features 
with balanced training in Table 3. The previous results 
are based on the predicted labels provided by the authors 
of [19]. The authors, in their paper [19], have reported 
the results with background as one of the classes, but the 
ground-truth labels for this dataset have portions of fore¬ 
ground classes labeled as the background. Therefore, even 
a correct segmentation is penalized. We ignore the back¬ 
ground class while reporting the results for a fair evaluation. 















Table 2: SIFT Flow result. 


Method 

PPA 

MCA 

loU 

TPI (s) 

CPU/GPU 

Tighe, [25] 

77.0 

30.1 

NA 

8.4 / NA 

Liu, [12] 

76.7 

NA 

NA 

31/NA 

Singh, [22] 

79.2 

33.8 

NA 

20/NA 

Eigen, [4] 

77.1 

32.5 

NA 

16.6/NA 

Farabet, [2] 

78.5 

29.6 

NA 

NA/NA 

(Balanced), [2] 

72.3 

50.8 

NA 

NA/NA 

Tighe, [24] 

78.6 

39.2 

NA 

> 8.4/NA 

Pinheiro, [18] 

77.7 

29.8 

NA 

NA/NA 

Yang, [27] 

79.8 

48.7 

NA 

< 12/NA 

Plain-NN 

76.3 

32.1 

24.7 

1.1/0.36 

RCPN, [21] 

79.6 

33.6 

26.9 

1.1/0.4 

bal. RCPN, [21] 

75.5 

48.0 

28.6 

1.1/0.4 

PN-RCPN 

80.9 

39.1 

30.8 

1.1/0.4 

bal. PN-RCPN 

75.5 

52.8 

30.2 

1.1/0.4 

TM-RCPN 

80.8 

38.4 

30.7 

1.6-6.1/0.9-5.4 

bal. TM-RCPN 

76.4 

52.6 

31.4 

1.6-6.1/0.9-5.8 


loU Dyn is the loU for dynamic objects ie cars, pedestrians 
and bicyclists. We would like to underscore that the previ¬ 
ous approaches ([10, 19]) use stereo, depth, visual odometry 
and multi-frame temporal information that relies on the fact 
that the images are coming from a moving vehicle whereas, 
we only use an independent single visual image and still ob¬ 
tain similar or better performance. We observe significant 
improvements in terms of loU with the use of PN-RCPN 
over RCPN and Plain-NN which could be due to the well 
structured image semantics of this dataset that allows it to 
learn the structure very effectively and utilize the context 
in a much better way than the other two datasets. Some of 
the representative segmentation results are shown in Fig. 5. 
We have also submitted a complete video of semantic seg¬ 
mentation for all the test images for Daimler urban in the 
supplementary material. 


Table 3: Daimler result. Numbers in italics indicate the use 
of stereo, depth and multi-frame temporal information. 


Method 

PPA 

MCA 

loU 

loU Dyn 

TPI (s) 

CPU/GPU 

Joint, [10, 19] 

94.5 

91.0 

86.0 

74.5 

111 /NA 

Stix., [19] 

92.8 

87.5 

80.6 

72.3 

0.05/NA 

bal. Plain-NN 

91.4 

83.2 

75.8 

56.2 

5.9/2.8 

bal. RCPN 

93.3 

87.6 

80.9 

66.0 

6.0/2.8 

bal. PN-RCPN 

94,5 

90.2 

84.5 

73.8 

6.0/2.8 

bal. TM-RCPN 

94,5 

90.1 

84.5 

73.8 

12/8.8 



Figure 5: Some representative image segmentation results 
on Daimler Urban dataset. Here, CNN refers to direct per- 
pixel classification resulting from the multi-scale CNN. The 
ground-truth images are only partially labeled and we have 
shown the unlabeled pedestrians by yellow ellipses. 


there are real-time super-pixellation algorithms, such as [3], 
that can help us achieve state-of-the-art semantic segmenta¬ 
tion within 100 milliseconds on an NVIDIA Titan Black 
GPU. 

6. Conclusion 


5.7. Segmentation Time 

In this section we provide the timing details for the ex¬ 
periments. Only the Multi-CNN feature extraction is ex¬ 
ecuted on a GPU for our Plain-NN and RCPN variants. 
Due to similar image sizes, SIFT fiow and Stanford Back¬ 
ground took almost the same computation per image ex¬ 
cept while using TM-RCPN, because of the difference in 
label state-space size. The time break-up for SIFT fiow 
(same for Stanford) in seconds is 0.3 (super-pixellation) 
0.08/0.8 (GPU/CPU visual feature) 0.01 (PN-RCPN) 
-I- 0.5-5 (TM-MRF). For Daimler, the corresponding tim¬ 
ings are 2.4 0.4/3.5 0.09 6 seconds. Therefore, the 

bottleneck for our system is the super-pixellation time for 
PN-RCPN and MRF inference for TM-RCPN. Fortunately, 


We analyzed the recursive contextual propagation net¬ 
work, referred to as RCPN [21] and discovered potential 
problems with the learning of it’s parameters. Specifically, 
we showed the existence of bypass errors and explained 
how it can reduce the RCPN model to an effective multi¬ 
layer neural network for each super-pixel. Based on our 
findings, we proposed to include the classification loss of 
pure-nodes to the original RCPN formulation and demon¬ 
strated it’s benefits in terms of avoiding the bypass errors. 
We also proposed a tree MRF on the parse tree nodes to uti¬ 
lize the pure-node’s label estimation for inferring the super¬ 
pixel labels. The proposed approaches lead to state-of-the- 
art performance on three segmentation datasets: Stanford 
background, SIFT fiow and Daimler urban. 
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